Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be available for download at QUERCUS. The teaching materials will consist of an R Markdown Notebook with concepts, comments, instructions, and blank coding spaces that you will fill out with R by coding along with the instructor. Other teaching materials include a live-updating HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
We’ll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to take you from some potential scenarios such as…
A pile of data (like an excel file or tab-separated file) full of experimental observations that you don’t know what to do with it.
Maybe you’re manipulating large tables all in excel, making custom formulas and pivot tables with graphs. Now you have to repeat similar experiments and do the analysis again.
You’re generating high-throughput data and there aren’t any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don’t know what that means or where to start.
and get you to a point where you can…
Format your data correctly for analysis.
Produce basic plots and perform exploratory analysis.
Make functions and scripts for re-analysing existing or new data sets.
Track your experiments in a digital notebook like R Markdown!
In the first lesson, we will talk about the basic data structures and objects in R, get cozy with the R Markdown Notebook environment, and learn how to get help when you are stuck because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), and then subset and merge data. After that, we will dig into the data and learn how to make basic plots for both exploratory data analysis and publication. We’ll follow that up with data cleaning and string manipulation; this is really the battleground of coding - getting your data into just the right format where you can analyse it more easily. We’ll then spend a lecture digging into the functions available for the statistical analysis of your data. Lastly, we will learn about control flow and how to write customized functions, which can really save you time and help scale up your analyses.
Don’t forget, the structure of the class is a code-along style: it is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don’t have to spend your attention on taking notes.
There is no single path correct from A to B - although some paths may be more elegant, or more efficient than others. With that in mind, the emphasis in this lecture series will be on:
tidyverse series of packages. This resource is
well-maintained by a large community of developers. While not always the
“fastest” approach, this additional layer can help ensure your code
still runs (somewhat) smoothly later down the road.This is the final in a series of seven lectures. Last lecture we explored the realm of statistical analyses with linear regression and other general linear models. Now we arrive at the final destination, addressing how to create looping and branching code, as well as our own functions in the topic of control flow. At the end of this session we will have covered:
Grey background: Command-line code, R library and
function names. Backticks are also use for in-line code.... fill in the code here if you are coding alongBlue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2024-09-IntroR/Lecture_XX. You will find a
partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF or HTML file under the Modules section of Quercus.
The following datasets used in this week’s class come from a published manuscript on PLoS Pathogens entitled “High-throughput phenotyping of infection by diverse microsporidia species reveals a wild C. elegans strain with opposing resistance and susceptibility traits” by Mok et al., 2023. These datasets focus on the an analysis of infection in wild isolate strains of the nematode C. elegans by environmental pathogens known as microsporidia. The authors collected embryo counts from individual animals in the population after population-wide infection by microsporidia and we’ll spend our next few classes working with the dataset to learn how to format and manipulate it.
It’s the last time we’ll be working with this dataset that we carefully created. It will help us work through the different aspects of control flow.
We’ll be using this source file later to show how you can save your own functions and import them for data analysis.
The following packages are used in this lesson:
tidyverse (tidyverse installs several packages for
you, like dplyr, readr, readxl,
tibble, and ggplot2). In particular we will be
taking advantage of the stringr package this week.
viridis our colour-blind friendly package for
providing specific colour palettes to our visualizations
Some of these packages should already be installed into your Anaconda
base from previous lectures. If not, please review that lesson and load
these packages. Remember to please install these packages from the
conda-forge channel of Anaconda.
conda install -c conda-forge r-biocmanager
BiocManager::install("limma")
conda install -c conda-forge r-gee
conda install -c conda-forge r-multcomp
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on RStudio
#--------- Load packages to for today's session ----------#
library(tidyverse)
library(viridis)
Don’t repeat code when you can use flow control!
Although we have only briefly touched on some of the aspects regarding control flow, it has been implemented behind the scenes in many of the functions you’ve used throughout this course. From your experience in R Markdown Notebooks, the order in which a code cell’s individual statements or instructions are executed can be considered part of control flow. Expanding on this idea, when you see the number order of the code cells, this also indicates the control flow of the entire notebook or program. Once a code cell is run, the objects it has generated remain stored in memory and available for access.
Within our code cells and overall program, control flow can involve statements that help to generate choice loops, conditional statements, and move throughout the program. These specific statements allow us to run different blocks of code at different times. This can be accomplished through
In this lecture, we’ll touch on all of these concepts to give you a taste of how you can make your programs accomplish more with less actual code. Let’s start by loading up an example dataset to play around with.
# set working directory
getwd()
## [1] "C:/Users/mokca/Dropbox/!CAGEF/Course_Materials/Introduction_to_R/2024.09_Intro_to_R/lecture_07_flow_control"
list.files("./data")
## [1] "190423_boxplot.facet.png" "190423_boxplot.facet.saveFunction.png"
## [3] "190423_boxplot.png" "200704_boxplot.facet.makeFunction.png"
## [5] "200704_boxplot.facet.png" "200704_boxplot.facet.saveFunction.png"
## [7] "200704_boxplot.png" "200704_graph.facet.saveFunction.tryCatchv2.png"
## [9] "200704_graph.facet.saveFunction2.png" "200711_boxplot.facet.makeFunction.png"
## [11] "200711_boxplot.facet.png" "200711_boxplot.png"
## [13] "200718_boxplot.facet.makeFunction.png" "200718_boxplot.facet.png"
## [15] "200718_boxplot.png" "200818_boxplot.facet.png"
## [17] "200818_boxplot.png" "200822_boxplot.facet.png"
## [19] "200822_boxplot.png" "200901_boxplot.facet.png"
## [21] "200901_boxplot.png" "200912_boxplot.facet.png"
## [23] "200912_boxplot.png" "200915_boxplot.facet.png"
## [25] "200915_boxplot.png" "221019_graph.facet.saveFunction.tryCatchv2.png"
## [27] "embryo_data_long_merged.csv" "Lecture07.all.RData"
## [29] "Lecture07.R" "Lecture07.Rdata"
## [31] "old"
# read our file in with read_csv()
embryos.df <- ...("...", col_types = "cdffffdfllddfddffff")
## Error in ...("...", col_types = "cdffffdfllddfddffff"): could not find function "..."
# explore our loaded data frame
head(embryos.df)
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
for() loops to repeat commands for a maximum
number of iterationsR doesn’t care if you write the same code 1000 times or have the
interpreter repeat a single copy 1000 times. However, the second is a
lot easier for you. The for() loop helps to reduce code
replication by compartmentalizing a set of instructions to repeat
instead of copying and pasting the same code several times.
More specifically, a for() loop executes a statement
repetitively until a well-defined endpoint. In this case, it determines
when a specific variable’s value is no longer contained in a given
sequence.
For example, let’s say that we want to add a + 2 10
times and overwrite it everytime:
# Increment a by 2, the bad way...
a <- 2
a <- a+2
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
a
## [1] 4
Sure, 10 times is doable by hand, just copy-paste. But what if you
need to perform that same task, say 1,000 times? What if the code was
more complex than a <- 2? That is when
for() loops come to the rescue.
# Increment 'anything' using a for loop
anything = 24
# Set up your for loop with a variable named 'tally'
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
tally
## function (x, wt = NULL, sort = FALSE, name = NULL)
## {
## UseMethod("tally")
## }
## <bytecode: 0x000001ac5880ef30>
## <environment: namespace:dplyr>
anything
## [1] 24
for loop can be described in three
stagesfor(x in y): Set a variable x to equal the
next value in a given sequence y{ code to run } Run a set of code which may use the
variable x at its assigned value in the cycleThere are a number of ways to set the counting variable within the
for() initialization. In reality, you just need to supply a
vector of elements for it to iterate through. This could be a sequence
where y is defined as a:b, or a numeric
vector, or even a vector of strings! Each of these is assigned to
x in our loop and must be used appropriately.
Note that without {...} enclosing your
code, R will run only the first statement right after the
for() call. This can exist on the same line, or on the next
line. Subsequent lines, regardless of indentation, will not be
run as part of the loop. This behaviour lets you quickly build
a simple for() loop or you can extend the behaviour to
accomplish many or more complex tasks.
Let’s take a look at the seq() function and how you can
use it within a for() loop.
# Use the seq() function
seq(from = 1, to = 10, by = 0.5)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
# let's use seq() in a for loop to count, no braces but indentation
for(... in seq(...))
print(variable)
## Error in seq(...): '...' used in an incorrect context
print("middle but not really")
## [1] "middle but not really"
print("This is the end")
## [1] "This is the end"
# for loop on a single line
for(variable in seq(1, 10, 0.5)) print(variable)... print("middle but not really"); print("This is the end")
## Error: <text>:2:49: unexpected symbol
## 1: # for loop on a single line
## 2: for(variable in seq(1, 10, 0.5)) print(variable)...
## ^
# for loop on a single line, with brackets
for(variable in seq(1, 10, 0.5)) ...; print("This is the end")
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
## [1] "This is the end"
for()
loopsAs was mentioned at the start of this section, under the hood, many
of the functions that we commonly use are just for() loops.
We can easily replicate them with explicit for loops but it takes up
extra coding time! For example, we can replicate the rep()
function.
# Use the rep() function to print the number 1-5, 8 times
rep(x = 1:5, times = 8)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Let’s duplicate the function of rep() with a
for() loop!
# for loop version variables need to be set
rm(result, i) # Remove the variables result and i, if they exist.
## Warning in rm(result, i): object 'result' not found
## Warning in rm(result, i): object 'i' not found
x <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
n <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
result <- x # What happens if we remove this line?
## Error in eval(expr, envir, enclos): object 'x' not found
# Build our for loop
for (i in 1:(...)){
result <- c(result, x)
print(result)
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
result
## Error in eval(expr, envir, enclos): object 'result' not found
i
## Error in eval(expr, envir, enclos): object 'i' not found
Why did we declare result <- x ahead of the for loop?
It can get a little complicated but for our purposes, we can say that
the offending issue lies within the for loop itself
result <- c(result, x). Remember, when the kernel
encounters this command, it tries to evaluate the right side of the
assignment first. When it goes to look for result it does
not exist and cannot complete the assignment. To help facilitate this,
we need to declare result outside the loop.
There are a few ways we could do this such as with
result <- NULL just so that it
exists as an initialized placeholder. Instead
we assigned it initially to hold the first iteration of our sequence.
Either would have worked but would require different numbers of loop
iterations.
If you declared result <- NULL or
result <- x within the loop, it would repeat this
command with every iteration, thus overwriting
it back to a native state with each loop. Nothing would progress! We’ll
use this concept to springboard us into the idea of scope.
Control flow statements as with other compartmentalized sections of code can be thought of as separate rooms in a house or sandboxes in a playground.
Thus a variable is either global or local in scope. If it is local,
then the information about it simply disappears at the end of the
function or control flow. The scope of a variable can usually be
considered as between the {...} of a programming section.
After you’ve left that section, anything explicitly declared within (ie
new variables from that section) will be released from memory. Of
course, R doesn’t exactly play by those rules, and stray variables can
float in memory. If you want to ensure that variables from something
like a for loop remain local, you can use the local()
command or create a function().
Lexical and Dynamic scoping: Going even deeper, R and other programming languages implement what are known as dynamic and lexical scoping. When you create functions within functions they can inherit variables based on the context of their creation. This can affect the behaviour of functions when they are used later within your programs. You can find more information on the rules of dynamic and lexical scoping here.
Why is scope important?
Understanding this concept will save you a lot of troubles down the road as you make more and more complex programs. You’ll learn to avoid declaring variables in the wrong place, or trying to access ones that no longer exist in your scope. Let’s revisit our example from above.
# Clear some memory and check the value of variables that may already exist
rm(result, j)
## Warning in rm(result, j): object 'result' not found
## Warning in rm(result, j): object 'j' not found
cat("The prior value of i is :", i, "\n")
## Error in eval(expr, envir, enclos): object 'i' not found
# for loop version variables need to be set
x <- 1:5
n <- 8
result <- 100
# Build a local for loop - this completely isolates any new variables from the global scope
...(
for (i in 1:n){
result <- c(result, x)
print(result)
j <- ... # assign a value to j
}
)
## Error in ...(for (i in 1:n) {: could not find function "..."
cat("The value of result is: ", result, "\n")
## The value of result is: 100
cat("The value of i is :", i, "\n")
## Error in eval(expr, envir, enclos): object 'i' not found
cat("The value of j is :", j)
## Error in eval(expr, envir, enclos): object 'j' not found
local() scope isolates your code from the
global environmentWhat happened to our variable result? You can see that
it was initially declared as the value of 100. When we
entered the local() scope and then had the first iteration
of our for() loop the code
result <- c(result,x) looked locally first for the
values of result and x but these variables did
not exist so it pulled the values from the global environment.
Subsequently a local result variable was then declared and
assigned a value. This local version of result was updated
with each iteration but the global version was
never altered. Similarly, within the
local() scope, the values of i were assigned
to a new version of i within the function and never
overwrote the original values of i in the main part of the
code cell.
A similar effect is seen when creating and using your own functions (to be discussed) but you can see that the kernel searches for variables (and functions) in the local namespace before checking the global namespace, followed by the namespaces of the loaded packages.
for() loopThe most useful thing to do with a for loop is to cycle through
values. Let’s return to embryos.df and plot the total
embryos for each observation across each infection date. As a twist
we’ll add each infection date one at a time using a loop until we get to
the final version of our visualization.
# Pull down the structure and colnames of our embryos.df
str(embryos.df, give.attr = FALSE)
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
...(embryos.df)
## Error in ...(embryos.df): could not find function "..."
# Grab a list of infection dates from the dataset
days = ...(embryos.df$infectionDate)
## Error in ...(embryos.df$infectionDate): could not find function "..."
for (i in 1:...) {
plot <-
embryos.df %>%
filter(infectionDate %in% ...) %>%
ggplot(.) +
# 2. Aesthetics
aes(x = infectionDate, y = embryos, colour = infectionDate) +
labs(title = paste0("Embryos per infection date with ", i, " days")) + # Add a title based on the day range
guides(colour = "none") +
# 4. Geoms
geom_jitter()
suppressWarnings(print(plot)) # Drop the warnings when we print the plot
# Sys.sleep(2) # Pause the system for 2 seconds
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
for()
loopAnother handy feature of the for() loop in R is being
able to directly give the loop a vector to iterator through until there
are no elements left. This will come in handy when applying the same
transformations, functions, or calculations on different subsets or
elements within a vector.
We’ll start with a simple example of looping through a small character vector.
# for loop in a single line, with brackets
for(variable in c(...)) {
print(variable)
print("scream;")
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
print ("for ice cream")
## [1] "for ice cream"
Lets use a t.test() to look for embryo production
differences between N2 and JU1400 animals when infected at medium dose
levels by the microsporidia LUAm1. We’ll use a for loop to
gather this information across all days.
# Build a very specific subset of data looking at only N2 and JU1400 populations
# infected by a medium does of LUAm1
subdata <-
embryos.df %>%
filter(... %in% c("N2", "JU1400"),
... == "LUAm1",
... == "Medium"
)
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
# create an empty data frame to store the output of the for loop
result <- ...(infectionDate = unique(subdata$infectionDate),
difference = NA,
p_value = NA)
## Error in ...(infectionDate = unique(subdata$infectionDate), difference = NA, : could not find function "..."
result
## [1] 100
# for loop to calculate difference in means between N2 and JU1400 infected by LUAm1 on the same date
for(i in ...) {
# Generate a t-test on subset by day
t <- t.test(embryos ~ wormStrain, subdata[subdata$infectionDate == i, ])
# write the results to our data frame
result[result$infectionDate == i, "difference"] <- diff(t$estimate)
result[result$infectionDate == i, "p_value"] <- t$p.value
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
result
## [1] 100
for loops run beneath the
group_by()If the code from above seems familiar in idea, you might recognize
that we are simply breaking the data into subgroups and performing a
t.test on it.
We’ve seen this kind of paradigm before using the
group_by() function in conjunction with
summarise(). Using a call to group_by() we can
make groups based on infectionDate and then passing along
to summarise() will produce the calculations we want on
each subgroup. In this case, the code is slightly cleaner and simplified
compared to the for loop.
subdata_ttest <-
# Pass the subdata
subdata %>%
# Group the data
... %>%
# Use Summarise to do the repetitive work for you
summarise(difference = diff(t.test(...)$estimate),
p_values = t.test(...)$p.value)
## Error in ...(.): could not find function "..."
subdata_ttest
## Error in eval(expr, envir, enclos): object 'subdata_ttest' not found
if()
statementsConditional branching only runs code when criteria have been met!
One of the big advantages of programming is to have conditional
statements in your code. R can make binary decisions like “if data meets
a condition, do this”. Some of these happen implicitly as in a
for() loop (ie keep repeating the code until you run out of
input) but you can also declare these decision branches
explicitly.
The if() (conditional argument) evaluates statements
that produce a single TRUE or FALSE result. The general
format is
if (boolean expression) {
# statement(s) will execute if the boolean expression is true.
}
Let’s give it a try on a simple example.
# Practice with an if() statement
x <- c("what", "is", "truth")
if(...) {
print("Truth is found")
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
else() statementNow that we know how to use if() statements, what if we
want to give a second instruction based on the outcome of the
if() statement? The else() and
else if() statements exist to extend the conditional branch
through additional considerations. In general, the structure looks like
this:
if(boolean_expression #1) {
# statement(s) will execute if the boolean expression #1 is TRUE.
} else if (boolean_expression #2) {
# statement(s) will execute if the new boolean expression #2 is TRUE.
} else {
# statement(s) will execute if none of the above boolean expressions were TRUE.
}
You can include any number of else if() statements in
the middle of the flow control but you should end with only a single
else() statement or none at all. Remember, the
else() statement is a catch-all, last-resort to deal with
any unexpected scenarios.
# Practice with a complex if() statement
x <- c("what", "is", "truth")
# Build a complex cascade of statements looking for Truth
if("TRUTH" %in% x) {
print("TRUTH is found")
} ... ("Truth" %in% x) {
print ("Truth is found")
} ... { # notice the placement of else is directly after the closing }
print ("The truth is out there somewhere")
}
## Error: <text>:7:3: unexpected symbol
## 6: print("TRUTH is found")
## 7: } ...
## ^
Remember that the if/else statements will
cascade through! Therefore, with proper ordering of your expressions,
you can simplify them as we see below with a grade assignment
conditional branching statement.
# Pick a student grade
grade <- 69
letterGrade <- "Unassigned"
# Long if statement for choosing grades
if (grade >= 90) { letterGrade <- "A+"
} else if (grade >= 85) { letterGrade <- "A"
} else if (grade >= 80) { letterGrade <- "A-"
} else if (grade >= 77) { letterGrade <- "B+"
} ... (grade >= 73) { letterGrade <- "B"
} ... (grade >= 70) { letterGrade <- "B-"
} ... {letterGrade <- "FZ"}
# What is the assigned letter grade?
letterGrade
## Error: <text>:10:3: unexpected symbol
## 9: } else if (grade >= 77) { letterGrade <- "B+"
## 10: } ...
## ^
if() statements can be nestedSometimes you may have a series of branching criteria that
you want met or you want to perform a series of additional checks after
a first level of criteria are met. In that case you may wish to use a
series of nested if() statements. Let’s take a
look at the code cell below for an example of nesting if()
statements.
# Only proceed if you have data in embryos.df
numVector <- c(1,2,3,4)
if(...) {
# You have data to look at so print it
print(numVector)
if(...) {
# Your vector has a minimum sum
print(sum(numVector))
if(...) {
# Your vector has a product of less than 20
print(prod(numVector))
} else {
print("product is >= 20")
}
} else {
print("sum is <= 8 ")
}
} else {
print("vector is empty")
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
if() statements to generate system
messagesIf/else statements can also be used to perform system-wide tasks, like generating a warning or breaking a code. For example, if we are writing a file to a directory and there is already a file with the same name, we should generate a warning or simply stop. Without the warning, the existing file will be silently overwritten.
# Check if our file exists
# Use dir() to return a vector of file names and then ask if any match ours.
if(...) {
print("Stop! A file with that same name already exists")
} else {
# The file does not exist, print the go-ahead and save the file
print("No files with the same name. Good to go!")
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
write_csv(x = subdata_ttest, file = "embryo_subdata_ttest.csv", col_names = TRUE)
## Error in eval(expr, envir, enclos): object 'subdata_ttest' not found
Challenge: Is there a cleaner way to produce our conditional?
Despite the warning output generated by our code, the file in our
example would still be overwritten. The call
to write.csv() is outside the
control flow of the conditional if()/else().
To fulfill our true intentions, we should move the placement of the
write_csv() function so that it is under the direct
influence of the control flow.
# Check if our file exists
# Use dir() to return a vector of file names and then ask if any match ours.
if(...) {
print("Stop! A file with that same name already exists")
} else {
# The file does not exist, print the go-ahead and save the file
print("No files with the same name. Good to go!")
# Write the file as part of the same control statement
write_csv(x = subdata_ttest, file = "embryo_subdata_ttest.csv", col_names = TRUE)
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
if() and else statement is an
effective control flow statement for simple tasksAs we’ve seen a couple of time in lecture now, rather than making a
large control flow block for simple tasks, we can supplement the
if() or ifelse() commands as a way to contain
all of our conditional statements and commands in one function.
The if() else syntax can take the take the simple form
of:
if (conditional_expression) TRUE_result else FALSE_result
The conditional_expression used in our statement must
evaluate to a single TRUE or
FALSE. In most cases, if this requirement is not met, an
error will be produced, or in the case of a logical vector, a warning
will be produced.
The results from the above syntax may also be assigned to a variable to use later. Let’s look at the following code cells for more examples.
# Use if when x is TRUE
x <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
if(x) "True result"
## Error in if (x) "True result": the condition has length > 1
# Use if when x is FALSE
x <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
if(x) "False result"
## Error in if (x) "False result": the condition has length > 1
# Use if when x is NA
x <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
if(x) "NA result"
## Error in if (x) "NA result": the condition has length > 1
# You can make complex logical expressions as long as they evaluate to either TRUE or FALSE!
x <- TRUE
y <- FALSE
z <- if(...) "At least one variable was TRUE!!!"
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
z
## Error in eval(expr, envir, enclos): object 'z' not found
# Use if else when x is TRUE
x <- TRUE
if(x) "True result" else "..."
## [1] "True result"
# Use if else when x is FALSE
x <- FALSE
if(x) "True result" else "..."
## [1] "..."
ifelse() statement allows vectorized
conditional assignmentLike the above if() statement, this allows us to assign
branched output without building the full branching structure. However,
as we alluded to in lecture 6, this is a much more
powerful command than it appears to be as you can supply a set of
vectors to this function to produce a vector of results!
ifelse(test = boolean_expression_vector,
yes = true_outcome_vector/true_outcome_action,
no = false_outcome_vector/false_outcome_action)
Watch out for vector recycling! It’s convenient for re-assigning values across vectors but note that we aren’t performing any complex actions or response - just assigning outcomes/values based on our evaluation expression.
# A simple example of ifelse()
rm(a)
i <- 8
ifelse(test = i < 5, yes = a <- 0, no = a <- 1)
## [1] 1
a
## [1] 1
# A complex vectorized example of ifelse()
i <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
ifelse(test = i < 5, yes = 0, no = 1) # Can we achieve this in a simpler way?
## [1] 1
# Don't forget that we can quickly convert booleans to numeric!
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
If you are looking for more ways to do this kind of general vectorised “if” assignment, you can look into the dplyr::case_when() function which will allow multiple conditionals and specific assignment outcomes.
switch()There are a lot of simple cases where one can imagine a series of
possible character input values and corresponding output
values.
For instance, when examining a specific series of categories or
character values, we can definitely create a complex and rather long
if/else/else if series of statements. We can however,
replace that long series of code with a more compact version where we
simply identify the case/assignment pairings.
In programming we call these switch or case statements. Let’s look at an example below.
# Pick your favourite pokemon!
pokemon <- "squirtle"
dexType <- "unknown"
# Look at how long this listing gets
if (pokemon == "bulbasaur") {dexType <- "plant"
} else if (pokemon == "squirtle") {dexType <- "water"
} else if (pokemon == "charmander") {dexType <- "fire"
} else if (pokemon == "pikachu") {dexType <- "electric"
} else if (pokemon == "lapras") {dexType <- "water/ice"
} else if (pokemon == "snorlax") {dexType <- "normal"
} else if (pokemon == "magikarp") {dexType <- "water"
} else {dexType <- "unknown input"}
dexType
## [1] "water"
As we can see above, things can get long and complicated for assigning values with an if statement.
# Pick your favourite pokemon!
pokemon <- "lapras"
dexType <-
...(...,
"bulbasaur" = "plant",
"squirtle" = "water",
"charmander" = "fire",
"pikachu" = "electric",
"lapras" = "water/ice",
"snorlax" = "normal",
"magikarp" = "water",
"unknown input")
## Error in ...(..., bulbasaur = "plant", squirtle = "water", charmander = "fire", : could not find function "..."
dexType
## [1] "water"
Switch cases or hashmaps? We’ve just spent a while working through longer examples of switch/case statements. There are some advantages and disadvantages to programming with this method. On the one side, you can quickly make some assignments for specific cases, on the other hand this is still cumbersome to deal with when your list of cases becomes long. In the above examples we wouldn’t want to program a whole pokedex like this but rather would use the power of dataframes or other data structures like a hashmap to help us manage the mapping of data from one piece of information to another. Remember, use the right data structure for the right circumstances!
There may be instances where you need to run loops on data until you find a certain piece of information, or until a specific condition is met rather than examining all of the elements within a set. There are two ways you can accomplish these “open-ended” loops.
while() loops run conditionallyUnlike using for() loops which continue to execute until
a specific iteration number, the while() loop executes a
command as long as a conditional expression continues to evaluate as
TRUE at each iteration. This conditional expression must
evaluate as TRUE to begin execution as well. The
while() loop can be thought of as a special implementation
of an if() statement that repeats over and over again until
the conditional fails.
Let’s work with some simple examples.
# Initialize our variable for conditional assessment
x <- 0
# Generate the while loop, incrementing x by 1 on each iteration, as long as x < 10
while(...) {
x <- x + 1
print(x)
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Loop will be ignored if the condition is FALSE and nothing gets printed
x <- ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
while(x < 10) {
x <- x + 1
print(x)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
When programming a conditional loop you must always include a statement that alters the condition or breaks out of the upcoming loop itself. It’s also important to note the order or placement of when you alter the condition in your loops. All the command statements within the loop, unless otherwise specified, will execute before the re-evaluation of the conditional statement.
For example, a programmer is assigned a task: “While you’re at the grocery store, buy some eggs”. The programmer never came back home.
# Set your initial value
programmer <- " at the grocery store"
# Build your while loop
while(programmer == " at the grocery store") {
print("buy some eggs")
programmer <- ... # What would happen if we commented out this line?
}
## [1] "buy some eggs"
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
print(programmer)
## [1] " at the grocery store"
# When do we provide the opportunity to change?
next and break to exit any
kind of looping structureThe explicit use of the next and break
commands will break free from the current looping structure but each
differs in what they do afterwards.
The next command will exit the current iteration in
the loop structure but will return to run the next iteration of
the loop.
Use this to skip over or avoid specific commands within your loop.
the break command will completely exit the loop
structure, as if it had reached its natural end.
Use this to permanently exit your looping structure.
Let’s use the following examples to see how these mechanisms work.
# using next within our for loop
for(i in 1:10) {
if (i >= 5 & i <= 8) {
... # skips ends the current iteration of the loop
}
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
i
## [1] 5
# Using break
for(i in 1:10) {
if (i == 5) {
... # completely exits the loop
}
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
i
## [1] 5
repeat loops run endlessly unless specifically
interrupted by breakUnlike the while loop, which can end through the conditional being
met, a repeat() loop has no explicit conditional statement
built into it’s formation. Instead, it will continue to repeat until it
is broken out of by the break command.
# Using repeat() to endlessly loop
i = 1
repeat {
if (i == 20) {
break # completely exits the loop
}
print(i)
i = i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
i
## [1] 20
Depending on the order in which you set up your
conditionals, you may accidentally produce unexpected issues. It is best
to consider the order in which you want to accomplish tasks within your
loops before beginning the next iteration. This is especially relevant
in the case of a conditional loop (while() or
repeat) where you must include a variable that can
eventually meet the desired conditions for exit.
Loops don’t care about you! Although loops and other control flow structures can vastly simplify our code, you must remember they are agnostic to your intentions. These structures have a very specific purpose and design, so to program successfully with these, we need to understand their inner workings.
Take the time to visually and mentally test your code using a series of base cases by asking yourself what input and output should look like: before the first iteration, after the first iteration, in the middle of your dataset, in your penultimate iteration, in your final iteration. Quickly assessing these on a small test set can also help you identify potential problems!
# Using repeat() to demonstrate that conditional placement matters.
i = 1
# What numbers will this code print?
# What happens if we move the print command around?
repeat {
...
if (i == 20) {
break # completely exits the loop
}
print(i)
}
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Depending on task you working on, perhaps there is already a function
that satisfies your need so you don’t have to use explicit
for() loops. Make use of existing functions whenever you
can because those have already been optimized to be fast and
efficient.
Taking advantage of functions can allow you to keep your code clean rather than programming for loops to generate a simple number pattern.
Use R’s vectorized functions: Many of the base R functions we’ve seen over the span of this course work well on vectors. In fact these functions are optimized to work on these data structures and you should take advantage of this. Often completing the same operations in something like a for() loop can take much longer. While not apparent on small datasets, you can begin to see the consequences of your choices on much larger ones. Here are some resources that highlight this efficient option.
Comprehension Question 1.0.0: In your own words describe the difference between a while loop and a for loop. What is the purpose of one over the other?
for()
loops with ggplot()Let’s say, we are ready to start making some plots for our
manuscript, and we want to make individual plots for each
infectionDate (replicate). The code below makes a boxplot
for each worm strain from the 190423 replicate of our data.
ggplot(embryos.df[embryos.df$infectionDate == "190423",]) +
#2 Aesthetics
aes(x = ..., y = ..., fill = wormStrain) +
# 4. Geoms
geom_boxplot()
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
But what if I were to have, say, multiple infection dates? In this case, a for loop will be the way to go. Take a look at the following code:
# Loop through the possible infection dates
for (...) {
infectionRep <-
ggplot(embryos.df[embryos.df$infectionDate == i,]) +
#2 Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle("Embryo counts") + # plot title
# 4. Geoms
geom_boxplot()
print(infectionRep) # This is the only way to view the plot in a for loop
# Save each plot as it's generated
ggsave(plot = infectionRep, filename = paste(i, "boxplot.png", sep = "_"), path = "data/" ,
scale=1, device = "png", units = c("in"))
}
## Error: <text>:2:9: unexpected ')'
## 1: # Loop through the possible infection dates
## 2: for (...)
## ^
From above you can see that we can take advantage of our variables
that increment within the for loop. We can use it to help subset data,
generate titles, and file names. You can use it in combination with
other control statements to update the image as well! Just remember to
avoid generating errors within your for() loop when access
or altering data. Ensure you aren’t trying to reference or alter data or
subsets that do not exist due to missing information in your original
datasets.
What if I want to facet our data for each infection data across
sporeStrain and doseLevel?
# Loop through the possible infection dates
for (i in unique(embryos.df$infectionDate)) {
infectionRep <-
ggplot(embryos.df[embryos.df$infectionDate == i,]) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
# We'll need to rotate the x-axis text if we keep the figure size about 12" wide
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Embryo counts on infection date: ", i)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(...) ### 2.1.0 Facet our data by strain and dose
# Only print the 190423 dataset
if (i == "190423") {
print(infectionRep) # The only way to see the plot is to print it within a for loop
}
# Save each plot as it's generated
ggsave(plot = infectionRep, filename = paste(i, "boxplot.facet.png", sep = "_"), path = "data/" ,
scale=1, device = "png", width = 12, height = 7, units = c("in"))
}
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
Yes! So far we’ve covered many options for control flow but all of our programs have been moving in a linear direction from start to end. That is also just a consequence of working with a Markdown notebook. Programs, however, are not necessarily run in a linear fashion.
What if you need to perform a set of similar instructions multiple
times, at multiple points within your control flow? Perhaps it’s even
the same kind of for() loop on different sets of data?
There are a lot of tricks like nested loops but you’re better off
knowing how to make functions that can be used in other code as
well!
The general structure of a script or program can be divided into
Global/environmental variables and declarations
Describe your script and assumptions
Import your libraries
Declare any global variables
Main program
Helper functions or subroutines
A best practice when writing functions is the “Do One Thing” principle: each function should do one thing; one task. Instead of a big function, you can write several small ones per task, without going to the other extreme which would be fragmenting your code into a ridiculous amount of small code snippets/functions. By doing the one (main) thing, your functions become:
Time to start writing our own functions!
While we have been using help() and ? to
look up documentation on the various functions we’ve been using, our
user-defined functions will not have any kind of accessible
documentation. Of course if we were making specific packages for R we
could create accessible
documentation.
Regardless of this problem, it is best practice to document your functions much like you document the rest of your code. In this case you can include information such as:
function()In R, a function is declared with the following syntax:
function_name = function(parameter1_name, parameter2_name, ... parameterN_name = preset_value) {
# The specific code of your function goes within the {...}
return(output)
}
Let’s convert our plotting code from above into a simple function!
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infection date, organized by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
make.facet.plot = function(...) {
infectionRep <-
# You could also filter your data with filter() and piping instead!
ggplot(data.df[data.df$infectionDate == infDate,]) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
# We'll need to rotate the x-axis text if we keep the figure size about 12" wide
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Print the plot
print(infectionRep)
# Save each plot as it's generated
ggsave(plot = infectionRep,
filename = paste(infDate, "boxplot.facet.makeFunction.png", sep = "_"),
path = "data/" ,
scale=1, device = "png", width = 12, height = 7, units = c("in"))
} # End of make.face.plot
Now that our subroutine is stored in memory, it can be called as we
want! Maybe even use it for different data sets as long as it meets the
requirements set out in our description of the function itself. You can
even build upon it to use control flow to decide if it will be faceted
or not. The code between the two versions is so similar, you could break
it into an if statement.
Call your functions from anywhere once they are stored in memory.
Let’s try to use it right now.
unique(embryos.df$infectionDate)
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
# Use a for loop to iterate through the first 3 levels of infectionDate
for (i in unique(embryos.df$infectionDate)[1:3]){
# Call on our function now
make.facet.plot(...)
}
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
return() commandSome of your functions may generate subsets of data or results that you would like to further investigate for analysis. For example, when we generate our plots, perhaps we would like to also retrieve information like where the file was saved, along with the subset of data for each.
Using the return() command has two consequences:
It will terminate or exit the function currently running once this command is called.
It will return a single object that will be assigned to a variable or be displayed to the console if unassigned
A special note about the returned object. This can be any kind of
object and if you want to return multiple objects, put them in
a list! Let’s update our function.
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date organizing by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
save.facet.plot = function(data.df, infDate) {
### 3.2.0 We've updated the plot to use a filter() function!
infection.data <- data.df %>% filter(infectionDate == ...)
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
# We'll need to rotate the x-axis text if we keep the figure size about 12" wide
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Save the name of the plot file
save.file = paste(infDate, "graph.facet.function.png", sep = "_")
# Save each plot as it's generated
ggsave(plot = infectionPlot,
filename = paste(infDate, "boxplot.facet.saveFunction.png", sep = "_"),
path = "data/" ,
scale=1, device = "png", width = 12, height = 7, units = c("in"))
### 3.2.0 return the file name and data subset
# Create a list so that you can send multiple objects back as a single object
return(list(infection.data, infectionPlot, save.file))
}
# Call on save.facet.plot function now
inf200704.plot <- save.facet.plot(embryos.df, "...")
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
# Look at the data
head(inf200704.plot...)
## Error in eval(expr, envir, enclos): object 'inf200704.plot...' not found
# Display the plot to output
inf200704.plot[[3]]
## Error in eval(expr, envir, enclos): object 'inf200704.plot' not found
# What's the file name?
inf200704.plot[[2]]
## Error in eval(expr, envir, enclos): object 'inf200704.plot' not found
Treat your functions like a black box! Now that you know how to build a function and return data from it, you should consider that it is best practice to treat your functions like a black box. What does that really mean? Your functions should stand alone, independent. If you were to copy it from one program or notebook to another, they should still work (for the most part). That means they should never have to rely on variables that exist outside of the function itself.
If you need to pass information to the function like a variable, dataframe, list etc - do this through the arguments! Whenever you need to return information, then return it as part of a list if needed. The function should be agnostic or independent of the world around it. Some assumptions can be made like loading preset libraries outside the function, but you can even do that within your functions!
The last helpful part of making functions is to consider providing default values for some of your arguments. In some cases you may have a subset of datasets that need to be treated differently so including an argument for your function to toggle certain behaviours is helpful. Including these arguments, however, means you have to define them every time you call on the function unless you assign a default value.
Default values are only overridden by supplied arguments, otherwise these will be applied within your function.
Before we update our save.facet.plot() let’s see what
happens if we simply forget to include a parameter.
# Rerun our function without an infection date
save.facet.plot(embryos.df)
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
As you can see, our user-defined function throws an error when we
neglect to provide an argument for the infDate parameter.
Let’s update the save.facet.plot() function by setting the
infDate parameter to a known date “190423”. This could
easily be something different like setting a logical parameter to
default to TRUE or FALSE, which could change
internal behaviours of the function itself.
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date organized by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
### 3.3.0 Set the default value of infDate to 190423
save.facet.plot = function(data.df, infDate = "...") {
# We've updated the plot to use a filter() function!
infection.data <- data.df %>% filter(infectionDate == infDate)
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
# We'll need to rotate the x-axis text if we keep the figure size about 12" wide
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Save the name of the plot file
save.file = paste(infDate, "graph.facet.function.png", sep = "_")
# Save each plot as it's generated
ggsave(plot = infectionPlot,
filename = paste(infDate, "boxplot.facet.saveFunction.png", sep = "_"),
path = "data/" ,
scale=1, device = "png", width = 12, height = 7, units = c("in"))
#return the file name and data subset
# Create a list so that you can send multiple objects back as a single object
return(list(infection.data, infectionPlot, save.file))
}
# Rerun our function without an infection date
save.facet.plot(embryos.df)
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
While a rarer occurrence, your user-defined functions can be used to instantiate and return a function itself. In these cases, the scoping of your variables can become a little trickier but variables within your code can be set using parameters from the initial function.
Let’s start with a simple example before we return to our plot-saving function.
# Define our function(s)
make.power = function(...) { # This sets the variable values (via lexical scoping) of the exponent
pow = function(...) { # When we call on the resulting function it will require a base value
base^power # Make the actual calculation
}
}
# Define a new function that does cubic calculations
cube = make.power(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Now we have a function cube() that takes a parameter called base to calculate base^power
# Call on our cubic function using a base of 4
cube(...)
## Error in cube(...): could not find function "cube"
Now let’s revisit our plot-saving function. We’ll make a new
plot-setting function that we can use to permanently set the data frame
that is used when making plots. We can initialize this newly set
function and save it as the function set.facet.plot().
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date organized by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
### 3.4.0 Define a new function where we set the data.df parameter as input.
set.facet.plot = function(...) {
# Set the default value of infDate to 190423
save.facet.plot = function(infDate = "190423") {
# We've updated the plot to use a filter() function!
infection.data <- data.df %>% filter(infectionDate == infDate)
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
# We'll need to rotate the x-axis text if we keep the figure size about 12" wide
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Save the name of the plot file
save.file = paste(infDate, "graph.facet.function.png", sep = "_")
# Save each plot as it's generated
ggsave(plot = infectionPlot,
filename = paste(infDate, "boxplot.facet.saveFunction.png", sep = "_"),
path = "data/" ,
scale=1, device = "png", width = 12, height = 7, units = c("in"))
#return the file name and data subset
return(list(infection.data, infectionPlot, save.file))
}
}
# Step 2. Make a function where the data set is embryos.df
make.embryo.plot <- set.facet.plot(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Make a plot and filter it by infection Date
infection.results <- make.embryo.plot("...")
## Error in make.embryo.plot("..."): could not find function "make.embryo.plot"
infection.results[[2]]
## Error in eval(expr, envir, enclos): object 'infection.results' not found
stop() function exits a function with a
messageSometimes you might produce a function that could fail at a number of
points for various reasons. While the R-kernel may simply produce a
warning and proceed, you may wish to stop the function wherever it is
rather than proceeding. Using the stop() function can help
produce “controlled” error stopping points in your program. You can also
include an optional message that will help to clarify why you have
stopped the function.
First, however, let’s produce a simple example of using the
stop() function.
# Let's see what happens when we work with the log function
log10(1)
## [1] 0
log10(0)
## [1] -Inf
log10(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
Suppose we aren’t interested in producing -Inf or
NaN values? We can build a wrapper around the
log10 function with some conditional branching inside
it.
get.log10 = function(x) {
if(x <= 0) ...
log10(x)
}
get.log10(1) # test our function
## [1] 0
get.log10(-1) # Check it will stop when it's supposed to
## Error in get.log10(-1): '...' used in an incorrect context
get.log10(10) # Will this code run?
## [1] 1
tryCatch() to identify errors without
stoppingIn our above example of stop() the result of using it
halts the execution of our code. Instead, sometimes we may wish to note
an error has occurred but we also want to proceed with the remainder of
the code. In that case you can use the tryCatch() function
which takes on a somewhat complex structure.
The tryCatch() function can be used to run an expression
(or lines of code) and if an error or warning
is produced, it can catch the result without halting your program’s
execution. Additional message information can be produced in each case
so that the user can be warned of potential issues. Using
tryCatch() takes the form of:
func_name = function(input) {
out <- tryCatch({ ## This is where we try code that might fail
expression(s) },
warning = function(condition) {
## statements to execute upon warning
message("Optional consolidated warning message")
return() # optional return value
},
error = function(condition) {
## statements to execute upon error
message("Optional consolidated error message")
return() # optional return value
},
finally = {
## Code to complete regardless of an error
}
) ## End of tryCatch
return(out)
}
Let’s focus again on our plotting functions we produced. Previously
our versions of save.facet.plot() included steps where the
input was being filtered - sometimes by sub-functions that should just
be producing a plot object. To remedy this we’ll go back to our rule of
“Do One Thing” and we’ll generate make.facet.plot() so that
it’s sole purpose is to produce a plot when given a filtered dataset
infection.data and a specific infection date
infDate.
# Simplify our main function which takes in pre-filtered data and plots it
# Define a new function where we set the data.df parameter as input.
# @Input
# infection.data: a filtered set of infection data that represents a single replicate date
# infDate: the actual replicate date that will be used in the title information
# Set the default value of infDate to 190423
make.facet.plot = function(infection.data, infDate) {
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
# We'll need to rotate the x-axis text if we keep the figure size about 12" wide
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
return(infectionPlot)
}
One of the things you can do as your functions and needs become more
complex is to nest functions within other functions. We’ve already
applied this when we call ggplot() functions within
save.facet.plot().
Next we want to generate a second function that will be able to
filter a set like embryos.df, call on
make.facet.plot(), and save the results as needed. In doing
so we simplify the debugging process and it will help when we begin to
incorporate a tryCatch() structure into our code.
# Make a function to filter data, make the plot, then save the plot
# @Input
# data.df: A dataset containing individual observations of embryo counts with at least:
# $wormStrain, $embryos, $infectionDate
# infDate: The infection date to filter data.df
#
# @Output
# List of 3 objects: 1) Filtered infection data
# 2) A ggplot object of the infection data
# 3) file name of saved plot
save.facet.plot = function(...) {
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=1, device = "png", units = c("in"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df, "200704")[[2]]
## Error in save.facet.plot(embryos.df, "200704"): object 'data.df' not found
Here’s where we need to get creative. What would happen inside
save.facet.plot() if we happened to forget to supply a
infDate parameter to our call? Previously we included a
default value like “190423” but we have no do so here. Using a call like
save.face.plot(embryos.df) will produce an error.
# save a facet plot but but don't provide a salinity type
save.facet.plot(...)[[2]]
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
tryCatch() series to try and capture
your errorInstead of allowing the execution to halt when we reach an error
maybe we can produce some messages and return a null value? In this
implementation we will return a NULL value for the user to
deal with.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- ...({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Error: potentially missing parameter information")
return(...)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
}) # End tryCatch
if (...) {
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatch.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}
else return(out) # if it's an error we'll get the NULL
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df)[[2]]
## Error in ...({: could not find function "..."
tryCatch() to set values within your
functionSuppose instead of just returning a NULL value when we
produce an error, we can change values on the user’s behalf and
continue? Of course our example here is in the context of an expected
error and we can’t always account for the nature of the error(s) we’ll
encounter. You could make things more complex and try to program some
statements to determine the error type!
In our example, we’ll try to anticipate the issue of a missing
salinity value and “assume” that will be our only problem. We’ll take
advantage of the <<- scoping assignment operator. It
will search the hierarchy of scopes until it can assign a value to the
specified variable. This happens in place of R dynamically assigning a
local variable.
Let’s modify save.facet.plot() function so that our
error handler can set the salinity.val variable within
save.facet.plot().
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Warning: No infection date provided")
message("substituting with a first-level value")
### 3.6.5 Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
# Set it to the first level of $infectionDate
infDate ... levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data ... filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
# return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
}) # End tryCatch
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}
## Error: <text>:18:13: unexpected symbol
## 17: # Set it to the first level of $infectionDate
## 18: infDate ...
## ^
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df)[[2]]
## Error in ...({: could not find function "..."
Here’s an alternative version of our code that runs all of the code
within the tryCatch call using the finally
option.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Warning: No infection date provided")
message("substituting with a first-level value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
# Set it to the first level of $infectionDate
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
# return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
},
### Set up a finally section which runs regardless of whether or not an error has been caught
# You could put all or some of your end-code here depending on your needs
# We'll move all the post-catch code into here for fun
... = {
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}) # End tryCatch
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df)[[2]]
## Error in save.facet.plot(embryos.df): object 'infection.data' not found
So it looks like we’ve provided some leeway for the user in case they
fail to provide any sort of infection date to subset our data. What if,
however, they simply provide an incorrect date? Let’s see what the
result will be if we try to run our current version of
save.facet.plot with an incorrect date.
save.facet.plot(embryos.df, "...")[[2]]
## Error in save.facet.plot(embryos.df, "..."): object 'infection.data' not found
As is the case above, we’ve accounted for a lack of input when we
call on save.facet.plot but not for the situation where the
input provided is incorrect! If we wanted to add another layer of
protection, we’d have to include that, or add some flow control as
below!
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Error: No infection date provided")
message("substituting with a first-level value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <<- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
}) # End the tryCatch having made a subset or default version if no infDate is supplied
# In the case of an INCORRECT infDate there are a couple of ways to go about doing it
# Check if our filtered data has any rows
if(...) {
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
} else {
# filter results in 0-row subset
return(rep("Warning: data filtering resulted in 0-row subset", 3))
}
}
# save.facet.plot(embryos.df)[[2]]
save.facet.plot(embryos.df, "221019")[[2]]
## Error: No infection date provided
## substituting with a first-level value
## Warning in levels(data.df$infectionDate): restarting interrupted promise evaluation
## Error in eval(expr, envir, enclos): object 'embryos.df' not found
Comprehension Question 3.0.0: As you saw in the last code cell, we used flow control to determine the state of the filtered/subset data. There are other ways we could use flow control to ensure our functions will work as expected. Take a look at the code cell below and complete the flow control measures based on your understanding of how factors work!
Hint: How do you check your unfiltered data for different factor levels? Which variable will you query?
# comprehension answer code 3.0.0
# Make a function to filter data, make the plot, then save the plot
save.facet.plot.updated = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Error: No infection date provided")
message("substituting with a first-level value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <<- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
}) # End the tryCatch having made a subset or default version if no infDate is supplied
# In the case of an INCORRECT infDate, how can we tell if it isn't a possible value?
if(...) {
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=1, device = "png", units = c("in"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
} else {
# filter results in 0-row subset
return(rep("Error: infData provided is not a level of infectionDate", 3))
}
}
Error-catching for the data scientist: While error-catching can seem complicated, it provides all sorts of ways to help save on headaches when debugging your code. As your code increases in complexity you may want to learn more about these systems. You can find a well-written section on debugging your code and using error-handling by Hadley Wickham as well as some helpful examples.
Now that you have the basics, you can continue to build on complexity (or simplicity) as you need it.
Call your functions from anywhere once they are stored in memory.
While working within the R environment we’ve learned to manipulate data and save it’s output as text or excel files. We’ve also learned to generate our own functions and save output as variables. When we create very useful functions and want to keep the code, there isn’t a need to necessarily copy and paste it into every script we make either.
In this last section we will discover how we can import our own functions, save data objects, and load R workspaces into memory.
source()As a final extension of our control flow lesson, you already know
about packages - these hold functions and data that are pre-made by
others within the R community. You normally install these with
install.packages() and then load them into memory with
library().
You don’t need to make your own packages to get similar capabilities using your customized functions. Instead, you can certainly make source files to keep functions and pertinent variables you may re-use in all of your analyses.
To access a saved “R” file which contains purely code (and
comments!), you can use the source() command. Let’s
try!
#?source
# Load data and information from another R script
source("...")
## Warning in file(filename, "r", encoding = encoding): cannot open file '...': Permission denied
## Error in file(filename, "r", encoding = encoding): cannot open the connection
ls() to find
variables and functionsAfter loading your script into memory, you may want to see what is
available in your environment’s memory. The ls() command
allows you to see what is available but it does not discriminate between
objects or functions.
# See what variables and functions you have in memory
print(...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
lsf.str()As you can see from above, using ls() ti
list all the objects
currently saved in memory but also the functions we’ve previously
declared and possibly some new ones imported from our call to
source(). To see which functions we have loaded outside of
those from packages in memory, we can use lsf.str() to
list functions in memory! Let’s see what’s new
and try something out.
# To see which functions are available in memory
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Let's look at a new function from "Lecture07.R"
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Look up newly added variables
codon_translation
## Error in eval(expr, envir, enclos): object 'codon_translation' not found
# Use codonToAA on a single codon
codonToAA("AUA")
## Error in codonToAA("AUA"): could not find function "codonToAA"
# Use codonToAA on multiple codons
codonToAA(c(...)) %>% str_flatten()
## Error in codonToAA(c(...)): could not find function "codonToAA"
save() objects or your whole kernel memory!From time to time you may have objects from analyses that aren’t
easily translated back as data tables or excel files. Perhaps you may
want to save objects or plots from a complex analysis for later use. You
can accomplish this with the save() command by providing a
list of one or more objects to save.
print(ls())
## [1] "a" "anything" "destinationDir" "dexType" "fname"
## [6] "fout" "get.log10" "gitCred" "githubDir" "i"
## [11] "lectureDir" "lectureName" "mainDir" "make.facet.plot" "make.power"
## [16] "n" "numVector" "originDir" "pokemon" "programmer"
## [21] "renderOut" "repo" "repoLocal" "repoURL" "result"
## [26] "save.facet.plot" "set.facet.plot" "temp" "termDir" "termGit"
## [31] "variable" "x" "y"
save(inf200704.plot, subdata, ...,
file="./data/Lecture07.RData") # Note the filetype we use to save data is "RData"
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
save.image() saves your entire workspaceSometimes you just want to save everything in memory. This may be a safeguard against accidental errors after running long analyses. The same can be said about saving single objects but you may find this a useful command in the future.
# Save an image of everything to an RData file
...(file="./data/Lecture07.all.RData")
## Error in ...(file = "./data/Lecture07.all.RData"): could not find function "..."
load() .RData files into memoryWhen you’re finally ready to revisit your saved objects or memory,
you’ll want to restore them. It’s as easy as using the command
load(). Let’s demonstrate, but first we need to clean up
our current memory with rm()
# Clear memory
rm(list = ls())
# check that it's clear
print(ls())
## character(0)
# reload it all
...("./data/Lecture07.all.RData")
## Error in ...("./data/Lecture07.all.RData"): could not find function "..."
print(ls())
## character(0)
As we wrap up this section, let’s go back and run one of our previously imported funtions!
# Let's try surprise.class()
surprise.class("...")
## Error in surprise.class("..."): could not find function "surprise.class"
While this is the end for us, it’s not the end for you!
Let’s review our time together. Over the span of this course we’ve discussed
dplyr package.tidyverse
package.ggplot2 package.stringr.You now have the tools to accomplish quite a few tasks and the foundation to grow your skills as needed. Let’s run a final function together to celebrate!
# Time to run our final function together
...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
There is no post-lecture assessment this week. Your DataCamp accounts will continue to remain active for another ~4 months during which time you can choose to explore the site’s different courses. Please take advantage of this opportunity to keep growing your R skills!
However, we have created a post-course survey you can fill out anonymously. You can use this survey as an opportunity to tell us about your experience and help shape the future offerings of this series. Please take 5-10 minutes to fill out the survey. We really appreciate your feedback!
Anonymous Google Survey found here
At the end of this lecture a Quercus assignment portal will be available to submit a RMD version of your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.5% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1600 hours the following day). To save your notebook:
Your final project will be due two weeks after this lecture at 23:59 hours on Wednesday October 30th. Please submit your final assignment as a single compressed file which will include:
.zip
file.Please refer to the marking rubric found in this course’s root directory on the datatools RStudio for additional instructions.
You can also build your R Markdown Notebooks on the UofT JupyterHub and save/download the files to your personal computer for compressing before submitting on Quercus.
Any additional questions can be emailed to me or the TAs or posted to the Discussion section of Quercus. Best of luck!
Don’t forget to submit your term project!
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1020H F LEC0142, 09-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.1: edited and prepared for CSB1020H F LEC0142, 09-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.2: edited and prepared for CSB1020H F LEC0142, 09-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.0: edited and prepared for CSB1020H F LEC0142, 09-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They?re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp’s catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.